-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support bundle checkpoint / preemptible workers #3882
Conversation
codalab/model/bundle_model.py
Outdated
@@ -1000,15 +1010,20 @@ def transition_bundle_worker_offline(self, bundle): | |||
# The user deleted the bundle or the bundle finished | |||
return False | |||
|
|||
if getattr(bundle.metadata, "preemptible", False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we access the preemptible field directly and have to use getattr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that if we do bundle.metadata["preemptible"]
, this will break bundles that started running in the previous deploy (but continue running during the deploy) that don't yet have this metadata key yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have missed it, but I didn't see any logic on the worker side on how bundles are resumed. How will the resumed bundle be run in the same working directory as its previous incarnation?
I do think that the bundle needs a preemptible
flag, because in some cases, you don't want your job to be resumed from a partially completed working directory because you'll get invalid results. In those cases, the bundle should not be restaged. Otherwise, I'm afraid that we're going to get into some bad infinite loops.
codalab/model/bundle_model.py
Outdated
@@ -1000,15 +1010,20 @@ def transition_bundle_worker_offline(self, bundle): | |||
# The user deleted the bundle or the bundle finished | |||
return False | |||
|
|||
if getattr(bundle.metadata, "preemptible", False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that if we do bundle.metadata["preemptible"]
, this will break bundles that started running in the previous deploy (but continue running during the deploy) that don't yet have this metadata key yet.
@percyliang @teetone One more thing -- with this PR, the bundle stats (such as time taken) only include the time for the last worker that ran a bundle (rather than the total time across all workers that ran a bundle). We might want to update that in the future, though, as that's just nice-to-have functionality that could take longer to implement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went over the PR offline with Ashwin
For documentation on how to use this feature, see: https://github.com/codalab/codalab-worksheets/blob/6423cc3ba1e6d9779224c6c66e97b59a9c208197/docs/Checkpoints.md
Original design doc: https://docs.google.com/document/d/1ifcuXLPldSWJehZgl9RTmPogllbyUJ_vRIOiW2BF1i0/edit#
Checklist: